non-convex stochastic gradient descent
Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent
Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications, the online version, where we observe one entry at a time and dynamically update our estimate, is more appealing. While existing algorithms are efficient for the offline setting, they could be highly inefficient for the online setting. In this paper, we propose the first provable, efficient online algorithm for matrix completion. Our algorithm starts from an initial estimate of the matrix and then performs non-convex stochastic gradient descent (SGD). After every observation, it performs a fast update involving only one row of two tall matrices, giving near linear total runtime. Our algorithm can be naturally used in the offline setting as well, where it gives competitive sample complexity and runtime to state of the art algorithms. Our proofs introduce a general framework to show that SGD updates tend to stay away from saddle surfaces and could be of broader interests to other non-convex problems.
Review for NeurIPS paper: Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations
Weaknesses: While the "biased expectation" appears to be a powerful tool, the overall results are restricted to the gradients of the algorithm at _some_ time t in the last T iterates. While this is a common outcome of the standard analysis of SGD, it would be nice if (with some additional assumptions on f) the results could be transposed to f(x_t) or x_t within some basin of attraction. The special case of s 0 needs much more detailed treatment. While the authors point out in the supplement that \phi is continuous at s 0, much of the document switches between looking at s- 0 or s 0 without explanation. Assumption 1: I see that the authors need to contol X_t 2 in Thm 1. (Eq.
Review for NeurIPS paper: Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations
After significant discussions with the reviewers, the reviewers were all unanimously in appreciation of the simplicity and cleanliness of the approach presented by the paper. However the authors are strongly encouraged to improve the presentation of the paper - especially the crucial proof of Lemma 1 - multiple steps have been contracted in the presentation and clarifying them is necessary. Furthermore the case of the diminishing step-size scheme is strongly suggested to be fleshed out in theory rather than being left as straightforward extensions. Lastly, the reviewers suggested to use heavier tailed distribution like the Levy distribution to verify the theory better.
Reviews: Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent
The main contribution in this work is showing that in the context of matrix completion, when equipped with a good initialization, the sequence of the solutions produced by a widely used (but without theoretical guarantee) SGD algorithm converges linearly to the true matrix. The paper reads well and most of theoretical result looks sound. But I have some concerns as follows. Is it possible to access fewer samples, e.g., "O(mu * d * k * log(d))" while still ensuring global convergence? Theorem 3.1 says "for any fixed T 1, with probability at least 1 - T/(d 10), we have linear convergence". That means, if we run the algorithm (i.e., repeatedly reveal the samples) for a longer time, we have a LOWER confidence to obtain the true matrix.
Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations
This work proposes a novel analysis of stochastic gradient descent (SGD) for non-convex and smooth optimization. In the case of sub-Gaussian and centered noise, we prove that, with probability 1-\delta, the number of iterations to reach a precision \varepsilon for the squared gradient norm is O(\varepsilon {-2}\ln(1/\delta)) . In the case of centered and integrable heavy-tailed noise, we show that, while the expectation of the iterates may be infinite, the squared gradient norm still converges with probability 1-\delta in O(\varepsilon {-p}\delta {-q}) iterations, where p,q 2 . This result shows that heavy-tailed noise on the gradient slows down the convergence of SGD without preventing it, proving that SGD is robust to gradient noise with unbounded variance, a setting of interest for Deep Learning. In addition, it indicates that choosing a step size proportional to T {-1/b} where b is the tail-parameter of the noise and T is the number of iterations leads to the best convergence rates.
Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent
Jin, Chi, Kakade, Sham M., Netrapalli, Praneeth
Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications, the online version, where we observe one entry at a time and dynamically update our estimate, is more appealing. While existing algorithms are efficient for the offline setting, they could be highly inefficient for the online setting. In this paper, we propose the first provable, efficient online algorithm for matrix completion.